Our first GLM practice run will be on the Kentucky Arrow Darter data set, comparing fish sex to total length (mm).
A male Kentucky Arrow Darter that is approximately 75 mm in length (Left) and a female Kentucky Arrow Darter that is approximately 65mm in length (Right) Photos provided by Matt Thomas KFWS.
First let’s upload the dataset.
kad_tissue_data <- read.csv("~/Desktop/Analytics/assignment1glm/GLM-Practice/data/kad_tl_data.csv")
Convert categorical Male and Female data to binary data (0 and 1)
kad_tissue_data$sex_binary <- factor(kad_tissue_data$sex, levels=c("M", "F"), labels = c(0,1))
Now change that factor data to interger
str(kad_tissue_data)
## 'data.frame': 107 obs. of 4 variables:
## $ tissue_id : chr "APSU-KAD-500" "APSU-KAD-501" "APSU-KAD-502" "APSU-KAD-503" ...
## $ tl_mm : int 65 85 78 91 71 63 71 65 70 72 ...
## $ sex : chr "F" "M" "F" "M" ...
## $ sex_binary: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 2 2 2 1 ...
kad_tissue_data$num_sex_binary<-as.numeric(as.character(kad_tissue_data$sex_binary))
Let’s plot the data on a box and whisker plot to see if there are any obvious trends
ggplot(kad_tissue_data, aes(sex,tl_mm)) +
geom_boxplot(color="blue", fill="green4", alpha=0.2) +
geom_jitter(height = 0, width = .15)+
xlab ("Sex") +
ylab ("Total Length (mm)") +
labs(title="Influence of Kentucky Arrow Darter Sex on Total Length (mm)")
As we can see, males had a higher mean total length than females. This suggests that males may be typically bigger than females.
We need to now find a proper GLM for our dataset
pois1 <- glm(tl_mm~num_sex_binary, family= poisson(link=log), data= kad_tissue_data)
summary(pois1)
##
## Call:
## glm(formula = tl_mm ~ num_sex_binary, family = poisson(link = log),
## data = kad_tissue_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.3298 -0.7556 -0.0348 1.0499 2.3012
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.30811 0.01641 262.583 < 2e-16 ***
## num_sex_binary -0.10028 0.02303 -4.355 1.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 184.74 on 106 degrees of freedom
## Residual deviance: 165.79 on 105 degrees of freedom
## AIC: 820.72
##
## Number of Fisher Scoring iterations: 4
the degrees of freedom and residual deviance are not similar, suggesting this model is not best fit
Let’s try another GLM
Let’s check to see if binomial distribution fits
binom1 <- glm(sex_binary~tl_mm, family= binomial, data= kad_tissue_data)
summary(binom1)
##
## Call:
## glm(formula = sex_binary ~ tl_mm, family = binomial, data = kad_tissue_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1275 -1.1139 0.7029 1.0436 1.5436
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.93252 1.51999 3.245 0.00117 **
## tl_mm -0.06778 0.02118 -3.201 0.00137 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 147.88 on 106 degrees of freedom
## Residual deviance: 135.83 on 105 degrees of freedom
## AIC: 139.83
##
## Number of Fisher Scoring iterations: 4
So it’s not a perfect fit, but better than poisson.
This is an autoplot to help us understand if it really fits the model
kad_tissue_data$tl_mm100 <- kad_tissue_data$tl_mm/100
fit.1 <- glm(num_sex_binary~tl_mm100, data=kad_tissue_data, binomial(link="logit"))
autoplot(fit.1)
Some of these don’t look great, but it is not awful, so we are going to take it.
Let’s try a binned plot
x <- predict(fit.1)
y <- resid(fit.1)
binnedplot(x, y)
Final binomial plot
ggplot(kad_tissue_data, aes(tl_mm,num_sex_binary)) +
geom_point(color="green4") +
geom_smooth(method="glm", method.args=list(family="binomial"(link="logit"))) +
xlab ("Total Length (mm)") +
ylab ("Probability of being Male (0) or Female (1)")
## `geom_smooth()` using formula 'y ~ x'
Binomial GLM is the best fit for our Kentucky Arrow data set examining the relationship between sex and total length (mm).
Now it’s time to move onto the next data set!
Acid mine drainage contaminating various ecosystems across the planet.
First, let’s import the dataset
water_chemistry <- read.csv("~/Desktop/Analytics/assignment1glm/GLM-Practice/data/water_chemistry.csv")
Now let’s do a raw plot to get the general idea of the data trends
ggplot(water_chemistry,aes(Fe,Cond)) +
geom_point(color="green4") +
geom_smooth(color="blue") +
xlab ("Iron") +
ylab ("Conductivity") +
labs(title="Raw Fit: How does iron effect conductivity?")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Let’s try to do a Poisson summary to see if that is a fitting model
pois2 <- glm(Fe~Cond, family= poisson(link=log), data= water_chemistry)
summary(pois2)
##
## Call:
## glm(formula = Fe ~ Cond, family = poisson(link = log), data = water_chemistry)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -66.637 -42.018 -29.247 0.295 162.800
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.381e+00 5.146e-03 1434.3 <2e-16 ***
## Cond 1.408e-03 3.074e-06 457.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287228 on 24 degrees of freedom
## Residual deviance: 70982 on 23 degrees of freedom
## AIC: Inf
##
## Number of Fisher Scoring iterations: 5
The residual deviance and degrees freedom have a large difference, so this is not a good model
Let’s try Gaussian
gaus2 <- glm(Fe~Cond, family= gaussian, data= water_chemistry)
summary(gaus2)
##
## Call:
## glm(formula = Fe ~ Cond, family = gaussian, data = water_chemistry)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5910.3 -1993.2 -107.3 466.3 10668.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -452.219 1042.662 -0.434 0.669
## Cond 12.915 1.251 10.326 4.16e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 17311224)
##
## Null deviance: 2244001448 on 24 degrees of freedom
## Residual deviance: 398158141 on 23 degrees of freedom
## AIC: 491.53
##
## Number of Fisher Scoring iterations: 2
It somehow got worse (there is even a larger difference between the residual deviance and DF)
Let’s plot Gaussian
ggplot(water_chemistry, aes(Fe,Cond)) +
geom_point() +
geom_smooth(method="glm", method.args=list(family="gaussian")) +
xlab ("Total Length (mm)") +
ylab ("Probability of being Male (0) or Female (1)")
## `geom_smooth()` using formula 'y ~ x'
The plot for Gaussian does not look awful, but I think we can do better
One more try, this time Gamma GLM
gam <- glm(Fe~Cond, family= Gamma(link="sqrt"), data= water_chemistry)
summary(gam)
##
## Call:
## glm(formula = Fe ~ Cond, family = Gamma(link = "sqrt"), data = water_chemistry)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4384 -0.5768 -0.2562 0.3429 0.9961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.14515 2.01703 8.004 4.25e-08 ***
## Cond 0.09333 0.01188 7.858 5.82e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.4210324)
##
## Null deviance: 71.737 on 24 degrees of freedom
## Residual deviance: 10.821 on 23 degrees of freedom
## AIC: 421.33
##
## Number of Fisher Scoring iterations: 6
Since the difference between residual deviance and DF is at the lowest we have seen for this dataset, this suggests that Gamma square root is the best fit model
last, but not least, let’s plot the Gamma square root GLM
ggplot(water_chemistry, aes(Fe,Cond)) +
geom_point() +
geom_smooth(method="glm", method.args=list(family="Gamma"(link ="sqrt"))) +
xlab ("Total Length (mm)") +
ylab ("Probability of being Male (0) or Female (1)")
## `geom_smooth()` using formula 'y ~ x'
Looks funky, but it’ll do?